Rationale - the Data lake

  • The GRINS foundation aims at implementating a Data platform for the transfer of knowledge and statistical analysis (AMELIA)

  • Prime matter of the platform: the data lake

    • Broad repository hosting several categories of administrative data from different sources
    • Available to either private, corporate or academiic users
    • Data organised at the territorial level of municipalities (LAU/NUTS-4)
  • The present R package is intended as a one of the several contributions to the data lake

Rationale - the Data lake

  • This R package covers the dimension of public education, with special regards to the territorial structure of the education system.

  • Main utility: analysing territorial disparities in education quality and school infrastructure endowment

  • Directly supports areal modelling

Principles followed

  • Accessibility: All data must be publicly accessible and easy to handle for the generic user
    • Input data are open and come from publicly accessible web pages
  • Updating: All information is retrieved in real time in order to be up-to date
    • Inputs are scraped from the web rather than stored in a built-in repository
  • Portability: All objects should be easy to export and process with different softwares:
    • We work in the framework, and all outputs are structured as tibbles

Main function modules

  • Get_: input data scraping. Information is not altered and the user receives a data set as close as possible as the provider releases it

  • Util_: utilities; mainly data modification and editing

  • Group_: data aggregation at the relevant territorial level

    • NUTS-3/Province
    • LAU/Municipality
  • Map_: displaying

    • Static maps (vector format): easy to export
    • Interactive maps: preserve information on different variables

Main datasets

  • Data from the Ministry of Education
    • Includes:
      • National Schools Registry
      • School Buildings database
      • Students and teachers counts
    • Mainly available at the school level (except for the count of teachers)
  • Ultra - Broadband implementation
    • Available at the school level
  • Invalsi census survey
    • Available at the NUTS-3 / LAU level

Schools Taxonomy

  • Schools ID - mechanographical codes
    • Most complete list: National Schools Registry
    • Identifies both school order and address (of high schools)
  • School buildings ID - typically numeric codes
    • Only included in the School buildings DB

School buildings database

  • Main source of information regarding the school infrastructure

  • Mostly includes categorical variables, regarding several aspects such as:

    • Environmental context
    • Reachability by public or private transport
    • Building period
    • Surfaces and volumes
  • As an example, in the next slides we display middle schools area surface (on logarithmic scale to ease the comparison) for the three regions of Apulia, Basilicata and Calabria.

School buildings database

Input_DB23_MIUR <- Get_DB_MIUR(Year = 2023, 
                               input_Registry = Registry23) 
## ## 2022/23 is the latest year available
#à ## then, remember adding message = FALSE

DB23_MIUR_mun <-  Group_DB_MIUR(Input_DB23_MIUR, InnerAreas = FALSE
                                )$Municipality_data %>% 
  dplyr::mutate(log_Surface = log(.data$School_area_surface))

head(DB23_MIUR_mun)

Mapping

DB23_MIUR_mun %>% 
  Map_School_Buildings(input_shp = Mun22_shp, field = "log_Surface", 
                       level = "LAU", order = "Middle",
                       region_code = c(16, 17, 18), verbose = FALSE)